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Abstract. In this work we show that the classification performance 
of high-dimensional structural MRI data with only a small set of train¬ 
ing examples is improved by the usage of dimension reduction methods. 
We assessed two different dimension reduction variants: feature selec¬ 
tion by ANOVA F-test and feature transformation by PCA. On the re¬ 
duced datasets, we applied common learning algorithms using 5-fold cross- 
validation. Training, tuning of the hyperparameters, as well as the per¬ 
formance evaluation of the classifiers was conducted using two different 
performance measures: Accuracy, and Receiver Operating Characteristic 
curve (AUC). Our hypothesis is supported by experimental results. 


1 Introduction 

Machine learning algorithms are used in various fields for learning patterns from 
data and make predictions on unseen data. Unfortunately, the probability of 
overfitting of a learning algorithm increases with the number of features |14| . Di¬ 
mension reduction methods are not only powerful tools to avoid overfitting cni, 
but also capable of making the training of high-dimensional data a compu¬ 
tationally more feasible task. In this work, we want to study the influence 
of dimension reduction techniques on the performance of various well-known 
classification methods. Dimension reduction methods are categorized into two 
groups: feature selection and feature transformation. 

Feature selection methods Pi] aim to identify a subset of “meaning-ful” 
features out of the original set of features. They can be subdivided into fil¬ 
ter, wrapper and embedded methods. Filter methods compute a score for each 
feature, and then select only the features that have the best scores. Wrapper 
methods train a predictive model on subsets of features, before the subset with 
the best score is selected. The search for subsets can be done either in a deter¬ 
ministic (e.g. forward selection, backward elimination) or random (i.e. genetic 
algorithms) way. Embedded methods determine the optimal subset of features 
directly by the trained weights of the classification method. 

In contrast to feature selection methods, feature transformation methods 
project the original high dimensional data into a lower dimensional space. Prin¬ 
cipal Component Analysis (PCA) is one of the most-known techniques in this 
category. PCA finds the principal axis in the dataset that explain most of the 
variance, without considering the class labels. Therefore we use PCA as the 
baseline for dimension reduction methods in this study. 

Among various feature selection methods, we limit our scope on filter meth¬ 
ods, as they do not depend on a specific classification method and therefore are 
suitable for the comparison of different classifiers p. Comparative studies on 
text classification have revealed, that univariate statistical tests like the x^-test 
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and ANOVA F-test are among the most effective scores for filtered feature selec¬ 
tion [T^. As the x^-test is only applicable on categorical data, we use the filter 
selection method based on the ANOVA F-test which is applicable on continuous 
features used for the evaluation of this study. 

As part of the 17th International Conference on Medical Image Computing 
and Computer Assisted Intervention (MICCAI), the MICCAI 2014 Machine 
Learning Challenge (MLC) aims for the objective comparison of the latest ma¬ 
chine learning algorithms applied on Structural MRI data m- The subtask of 
binary classification of clinical phenotypes is in particular challenging, since in 
the opinion of the challenge authors, a prediction accuracy of 0.6 is acceptable. 
Motivated by this challenge, the goal of this study is to show, that the selected 
dimension reduction methods improve the performances of various classifiers 
trained on a small set of high-dimensional Structural MRI data. 

The report is organized as follows. Section describes the dimension re¬ 
duction methods. Section |3] describes the classifiers. Section |4] describes the 
datasets, evaluation measures and the experiment methodology. Section 
presents the results. Section discusses the major findings. Section sum¬ 
marizes the conclusion. 


2 Dimension reduction 


In the following we will give an overview of the used techniques for dimension 
reduction. 


Filtered feature selection by ANOVA F-test Feature selection methods based on 
filtering determine the relevance of features by calculating a score (usually based 
on a statistical measure or test). Given a number of selected features s, only the 
s top-scoring features are afterwards forwarded to the classification algorithm. 
In this study, we use the ANOVA F-Test statistic |5] for the feature scoring. The 
F-test score assesses, if the expected values of a quantitative random variable x 
within a number of m predefined groups differ from each other. The F-value is 
defined as 

MSb 
MSw ’ 

MSb reflects the “between-group variability”, expressed as 


MSb 


m — 1 


where nt is the number of observations in the i-th group, Xi denotes the sample 
mean in the i-th group, and x denotes the overall mean of the data. MSw 
refers to the “within-group variability’, defined as 

MSw = 

n — m 


where Xij denotes the j-th observation in the Ath group. For the binary classi¬ 
fication problem assessed in this report, the number of groups m = 2. 


Feature transformation by PCA PCA reduces the dimension of the data 
by finding the first s orthogonal linear combinations of the original variables 
with the largest variance. PCA is defined in such a way that the first principal 
component has the largest possible variance. Each succeeding component in 
turn has the highest variance possible under the constraint that it is orthogonal 
to the preceding components. 
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3 Classifiers 


In the following we give a short description of each used classification methods 
in this study. 

k-Nearest Neighbors (k-NN) The k-NN classifier p.l25] does not train a spe¬ 
cific model, but stores a labeled training set as “reference set”. The classification 
of a sample is then determined by the class with the most representatives among 
the k nearest neighbors of the sample in the reference set. Odd values of k pre¬ 
vent tie votes. Among other possible metrics, we use the Euclidean distance 
metric for this study. 

Gaussian Naive Bayes (GNB) Bayes classifiers are based on the Bayes’ theorem 
and depend on naive (strong) independence assumptions [T^H]. Using Bayes’ 
theorem, the probability P{iOj \ x) of some class given a d-dimensional 
random feature vector x G can be expressed as the equation: 


P{ujj I x) = 


P{Wj)p{-K I UJj) 
p(x) 


where P{ujj) denotes the prior probability of the j-th class ujj, p(x | ujj) refers 
to the class conditional probability density function for x given class ujj, and 
p(x) is the evidence factor used for scaling, which in the case of two classes is 
defined as 


2 

i=2 

Under the naive assumption that all the individual components Xi,i = 1,..., d 
of X are conditionally independent given the class, p(x | coj) can be decomposed 
into the product p{xi \ ujj).. ■p{xd \ ujj). Therefore we can rearrange P{ujj \ x) 
as 


P( I ^ 

p(x) 

Since p(x) is constant for a given input under the Bayes’ rule, the naive Bayes 
classifier predicts the class ojk that maximizes the following function: 

d 


cjk = arg max 

3 


P{^3)Wpi^i I ^ 3 ) 

i=l 


Under the typical assumption that continuous values associated with each class 
are Gaussian distributed, the probability density of a component Xi given a class 
LOj can be expressed as 


p{Xi I OJj) 


2-Kal 


exp - 


{Xi Pij ) 

2TTaf^ 


where pij denotes the class conditional mean and the class conditional vari¬ 
ance. The corresponding classifier is called Gaussian Naive Bayes. 


Linear Discriminant Analysis (LDA) Given a two class problem, the LDA al¬ 
gorithm separates the projected mean of two classes maximally by a defined 
separating hyperplane, while minimizing the variance within each class II P- 
117-124]. LDA is based on the assumption that both classes are normally dis¬ 
tributed and share the same covariance matrix. 
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Ridge The Ridge classifier is based on Ridge Regression, which extends the 
Ordinary Least Squares (OLS) method with an additional penalty term to limit 
the L^-norm of the weight vector m- This penalty term shrinks the weights 
to prevent overfitting. 

Support Vector Machine (SVM) A support vector machine [21 P- 325] solves 
the classification of a dataset by constructing a hyperplane in a high or infinite 
dimensional space in such a way that the distance of the nearest points of the 
training data to the hyperplane is maximized. The idea of the large margin is to 
ensure, that samples which are not exactly equal to the training data can still 
be classified in a reliable way. To prevent overfitting by permitting some degree 
of misclassifications, a cost parameter C controls the trade off between allowing 
training errors and forcing rigid margins. Increasing the value of C increases the 
cost of misclassifying points and forces the creation of a more accurate model 
that may not generalize well. For our experiment we use a SVM classifier with 
linear kernel (SVM-L) as well as a SVM classifier with a non-linear kernel using 
radial basis functions (SVM-RBF). 

Random Forests (RF) Bagging predictors [3] generate multiple versions of a 
predictor (in this case decision trees) which are used to get an aggregated predic¬ 
tor. By generating a set of trees in randomly selected subspaces of the feature 
space |T], the different trees generalize their classification in complementary 
ways. 

4 Experiment settings 

In the following section we illustrate the conducted experiments in detail. In 
this study we used the machine learning library scikit-learn version 0.14.1 for 
all proposed methods and scoring measures in this study. This open source 
Python library provides a wide variety of machine learning algorithms based 
on a consistent interface, which eases the comparison of different methods for a 
given task. 

4.1 Dataset 

In our experiments, we used the dataset for the binary classification task of the 
MLC 2014 [T3]. This dataset consists of 250 Tl-weighted structural brain MRI 
scans: 150 scans including the target class labels for training and additional 100 
samples with unknown class labels, reserved for the challenge submission. For 
each scan, a number of 184 morphological summary features are provided. These 
features represent volumes of cortical and sub-cortical structures, as well as 
average thickness measurements within cortical regions. The volume measures 
have been normalized with the intracranial volume (ICV) to account for different 
head sizes. All features have been extracted using the brain MRI software 
FreeSurfer |^. 

4.2 Evaluation measures 

As recommended by the MLC 2014 challenge m, we used two common perfor¬ 
mance measures: Accuracy and the area under the Receiver Operating Char¬ 
acteristic curve (AUC). Both compare the predictions of the classifier with the 
groundtruth provided in the training data. 

^http://scikit-learn.org 



OAGM Workshop 2015 (arXiv:1505.01065) 


5 


Accuracy The accuracy is defined as follows: 


accuracy = 


tp + tn 

tp + fp + tn + fn' 


where where tp, tn, fp, fn present the number of true positives, true negatives, 
false positives and false negatives, respectively. 


Area under the ROC curve (AUC) The ROC curve presents the tradeoff between 
the true positive rate (TPR), expressed as 


TPR = 


tp 

tp + fn 


and the false positive rate (FPR), defined as 


FPR = 


fp 

fp + tn' 


Given a two class problem, a ROC curve can be plotted by varying the proba¬ 
bility threshold for predicting positive examples in an interval between zero and 
one. Informally, one point in ROC space is better than another if it is to the 
northwest (tp rate is higher, fp rate is lower, or both) of the first. Hence the 
curve visualizes, for what region a model is more superior compared to another. 
The area under the ROC curve (AUC) maps this relation to a single value. 


4.3 5-Fold cross-validation (CV) 

We used 5-fold CV by randomly splitting the training dataset (D) of 150 samples 
into five mutually exclusive subsets (Di, D 2 , D^, D 4 , Df) of approximately equal 
size. Each classification model was trained and tested five times, where each 
time (t G {1,2,3,4, 5}), it was trained on all except one fold (D\Dt) and tested 
on the remaining fold (Dt). The accuracy and AUC measures were averaged 
over the particular measures of the five individual test folds. 


4.4 Experiment Methodology 

Our experiments were conducted in the following way. We applied each dimen¬ 
sion reduction method on the original training set with a different number of 
s G {3, 6,12, 24,48, 92,184} selected components. We trained the classifiers on 
the 150 datasets with known target class labels using 5-fold CV in two ways: 
the first by optimizing the accuracy measure and the second by optimizing the 
AUC measure. For classifiers based on a set of specific hyperparameters, we 
performed a grid search to find the optimal configuration of hyperparameters. 
As an exhaustive search over all possible hyperparameters would be an unfeasi¬ 
ble task, we limited our scope on a subset of hyperparameters for each classifier 
with a discrete set of tested values. Table [TJshows the selected hyperparameters 
and the corresponding set of values for each classifier. 


5 Results 


Fig.[2shows the performance of the classifiers on the basis of ANOVA F-test fea¬ 
ture selection using accuracy (Fig. la I and AUC (Fig. lb I for hyperparameter¬ 
tuning and performance evaluation, respectively. Both figures reveal that, with 
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Number of selected features 

GNB KNN *-* LOA A A RF T-T Ridge SVM-RBF X-X 5VM| 



(a) Hyperparameter tuning and eval- (b) Hyperparameter tuning and eval¬ 
uation using accuracy nation using AUC 


Figure 1: Classifier performances using ANOVA-hased feature selection. 


s = 12 selected features, the classifiers achieve already the same or better perfor¬ 
mances than using the original s = 184 features. When the number of selected 
features is further increased, the performance of the RBF-SVM peaks at s = 92, 
while the performances of the other classifiers do not improve or rather decline. 
This observation shows the importance of feature selection, as more features do 
not necessarily lead to better performance (overfitting). 

Fig. [^displays the classifier performance on the basis of PCA-reduced data 
using accuracy (Fig. 2a I and AUC (Fig. 2b) for hyperparameter-tuning and 
performance evaluation, respectively. Both figures show that the performances 
of SVM-RBF and KNN are both independent from the amount of used com¬ 
ponents, with the difference that SVM-RBF outperforms the other classifiers, 
while KNN exhibits a constantly weak performance over all used components. 
The other classifiers perform already better on the first s = 3 components of 
the PCA, than on the original features. When the number of used components 
s is further increased, the classifiers show a common performance breakdown 
at s = 12. Increasing the number of components s leads to the performance 
maximum at a number of s = 24 components. 

In this study we additionally observe that the GNB classifier performs better 
than the LDA classifier for a number of s > 12 selected components, although 
both methods share the assumption that the random variables are independent 
from each other and normally distributed. The key difference is that the LDA 
method additionally considers the covariance of the dimensions. When the num- 


Table 1: The selected hyperparameters and corresponding values for hyperpa¬ 
rameter optimization using grid search. 


Classifier 

Tuned hyperparameters 

KNN 

k e {3,5,7,9,11,13,15} 

Ridge 

a e {0.1,1,10} 

SVM 

Ce {io°,ioi,...,io®} 

SVM-RBF 

7G {10-1°, 10-®,..., 10^} 

Cg {10°,101,...,10®} 

RF 

Ntrees G {2,4,8,16,32}, 
with number of trees Ntrees 
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(a) Hyperparameter tuning and eval¬ 
uation using accuracy 



Number of selected features 

|»-« GNB KNN *-» LDA A A RF T-T Ridge SVM-RBF X-4( 5VM| 

(b) Hyperparameter tuning and eval¬ 
uation using AUC 


Figure 2: Classifier performances using PCA-based dimension reduction. 


ber of samples is lower than the number of dimensions, like it is the case in this 
study, the accurate estimation of the covariance matrix can not be guaranteed. 
This phenomenon is known as the “small sample size” problem m- This ob¬ 
servation suggets that, due to its simpler assumptions, GNB is a more robust 
classification method than LDA, given a small sized training set. 

6 Discussion 

The performances of the majority of investigated classifiers converge consistently 
at the same number of s selected features, independent of the measure used for 
the tuning of hyperparameters. This indicates that the search for the optimal 
number of selected features is a robust way to improve the performance of 
classifiers given high dimensional data. The results confirm that the RBF- 
SVM classifier outperforms the other classifiers independent from the number 
of reduced features. But the results also show that linear classifiers like GNB 
and Ridge are able to produce equal or even better results on reduced dimensions 
using the chosen feature selection methods than the RBF-SVM classifier. 

7 Conclusion 

The performances of classifiers under various scores for hyperparameter tuning 
combined with different dimension-reduction methods are analyzed. Both di¬ 
mension reductions improved the performance of all classifiers in comparison to 
the original high-dimensional data. The results indicated that ANOVA F-Test 
feature selection yielded the better results compared to the PGA-based feature 
transformation. 
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